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Résumé : Ce papier traite de l'étiquetage séquentiel de signaux, c'est-à-dire de 
discrimination pour des échantillons temporels. Dans ce contexte, nous propo- 
sons une méthode d'apprentissage pour un filtrage vaste-marge séparant au mieux 

■^1- ' les classes. Nous apprenons ainsi de manière jointe un SVM sur des échantillons 

et un filtrage temporel de ces échantillons. Cette méthode permet l'étiquetage en 
ligne d'échantillons temporels. Un décodage de séquence hors ligne optimal utili- 
sant l'algorithme de Viterbi est également proposé. Nous introduisons différents 

i- termes de régularisation, permettant de pondérer ou de sélectionner les canaux 

automatiquement au sens du critère vaste-marge. Finalement, notre approche est 
testée sur un exemple jouet de signaux non-linéaires ainsi que sur des données 
réelles d'Interface Cerveau-Machine. Ces expériences montrent l'intérêt de l'ap- 
prentissage supervisé d'un filtrage temporel pour l'étiquetage de séquence. 
Mots-clés : SVM, Ftiquetage séquentiel, Filtrage 



1 Introduction 

Signal séquence labeling is a classical machine learning problem that typically arises 
in Automatic Speech Récognition (ASR) or Brain Computer Interfaces (BCI). The idea 
is to assign a label for every sample of a signal while taking into account the sequentia- 
lity of the samples. For instance, in speaker diarization, the aim is to recognize which 
speaker is talking along time. Another example is the récognition of mental states from 
Electro-Encephalographic (EEG) signais. This mental states are then mapped into com- 
mands for a comp uter (virtual keyboard, mouse) or a mobile robot, hence the need for 



sample labeling Blankertz et al. (2004); Millân (2004). 



One widely used approach for perfo rming séquence labeling is Hidden Markov Mo- 



dels (HMMs), cf. (Ca ppéef alx 120051) . HMMs are probabilistic models that may be 
used for séquence decoding of discrète states observations. In the case of continuous 
observations such as signal samples or vectorial features extracted from the signal, 
Continuous Density HMMs are considered. When using HMM for séquence deco- 
ding, one needs to hâve the conditional probability of the observations per hidden states 



classes), which is usually obtained through Gaussian Mixtures (GM) (CappéefaZ. 



(cla 
I2ÔÔ 



2005). But this kind of model performs poorly in high dimensional spaces in terms 
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of discrimination, and récent works hâve shown that the decoding accuracy may be 
improved by using discriminative models CSloin & Bursh tein. 2008). One simple ap- 
pr oach for using discriminativ e classifiers in the HMM framework has been proposed 
by Ganapathiraju et al. (2004). It consists in learning S VM classifiers known for their 
better robustne ss in high dimens ion and to transform their outputs to probabilities using 



Platt's method dLinefq/.L 120071) . leading to better performances after Viterbi decoding. 



However, this approach supposes that the complète séquence of observation is avai- 
lable, which corresponds to a n offline decoding. In the case of BC I application, a real 
time décision is often needed (IBlankertz et ail 12004: MillânL 120041) . which restricts the 
use of the Viterbi decoding. 



Another limit of HMM is that they cannot take into account a time-lag between the 
labels and the discriminative features. Indeed, in this case some of the learning observa- 
tions are mislabeled, leading to a biased density estimation per class. This is a problem 
in BCI applications whe re the interesting inf ormation are not always synchronized with 
the labels. For instance. IPistohl et al\ ([2008) showed the need of applying delays to the 
signal, since the neuronal activity précèdes the actual movement. Note that they selec- 
ted the delay through validation. Another illustration of the need of time-lag automated 
handling is the following. Suppose we want to interact with a computer using multi- 
modal acquisitions (EEG,EMG,. . .). Then, since each modality has its own time-lag 
with respect to neural activity as shown by Salenius et al. ( 1996), it may be difficult to 
manually synchronize ail modalities and better adaptation can be obtained by learning 
the "best" time-lag to apply to each modality channel. 



Furthermore, instead of using a fixed filter as a preprocessing stage for signal denoi- 
sing, learning the filter may help in adapting to noise c haracteristics of each channel 



in addition to the time-lag adjustment. In such a context. iFlamarv et al.\ yOlO) propo 



sed a method to learn a large margin filtering for linear SVM classification of samples 
(FilterSVM). They learn a Finite Impulse Response (FIR) filter for each channel of the 
signal jointly with a linear classifier. Such an approa ch has the fla vor of the Common 
Sparse-Spatio-Spectral Pattern (CSSSP) of Dornhege et al.\ (1200a) as it corresponds to 
a filter which helps in discriminating classes. However, CSSSP is a supervised feature 
extraction method based on time-windows, whereas FilterSVM is a sequential sample 
classification method. Moreover, the unique temporal filter provided by CSSSP cannot 
adapt to différent channel properties, at the contrary of FilterSVM that learns one filter 
per channel. 



In this paper, we extend the work of iFlamarv et al.\ (120 lOf) to the non-linear case. 
We propose algorithms that may be used to obtain large margin filtering in non-linear 
problems. Moreover, we study and discuss the effect of différent regularizers for the fil- 
tering matrix. Finally, in the expérimental section we test our approach on a toy example 
for online and offline décision (with a Viterbi decoding) and investigate the parameters 
sensitivity of our method. We also benchmark our approach in a online séquence labe- 
ling situation by means of a BCI problem. 
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2 Sample Labeling 

First we define the problem of sample labeling and the filtering of a multi-dimensionnal 
signal. Then we define the SVM classifier for filtered samples. 



2.1 Problem définition 

We want to obtain a séquence of labels from a multi-channel signal or from multi- 
channel features extracted from that signal. We suppose that the training samples are 
gathered in a matrix X 6 W ixd containing d channels and n samples. X^ v is the value 
of channel v for the i th sample. The vector y 6 { — 1, 1}™ contains the class of each 
sample. 

In order to reduce noise in the samples or variability in the features, a usual approach 
is to filter X before the clas sifier learning stage . In literature, ail channels are usually 



filtered with the same filter dPistohl et al. (2008) used a Savisky-Golay filter) although 



there is no reason for a single filter to be optimal for ail channels. Let us define the 
filter applied to X by the matrix F G W xd . Each column of F is a filter for the 
corresponding channel in X and / is the size of the filters. 
We define the filtered data matrix X by : 

/ 

u=l 

where the sum is a unidimensional convolution of each channel by the filter in the 
appropriate column of F. nç, is the delay of the filter, for instance no = corresponds 
to a causal filter and îiq = f/2 corresponds to a filter centered on the current sample. 



2.2 SVM for filtered samples 

A good way of improving the classification rate is to filter the channels in X in order 
to reduce the impact of the noise. The simplest filter in the case of high frequency noise 
is the average filter defined by F VyU = 1/ f, Vi G {1, . . . , /} and j G {1, . . . , d}. uq is 
selected depending on the problem at hand, uq=Q for a causal filtering of no > for 
a non-causal filtering. In the following, using an average filter as preprocessing on the 
signal and an SVM classifier will be called Avg-SVM. 

Once the filtering is chosen we can learn an SVM sample classifier on the filtered 
samples by solving the problem : 

min i|| 5 || 2 + ^Viï(y*,X,:..,3) (2) 

s 2 n *— ' 

i— 1 

where C is the regularization parameter, g(-) is the décision function and H (y, x, g) = 
max(0, 1 — y ■ g(x)) is the hinge loss. In practice for non-linear case, one solve the dual 
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form of this problem wrt. g : 

n,n JV 

max Jsvm (a, -F) = max - Y^ yiyja.iO.jKii + V^ a % (3) 

a a z — » z — * 

i,j i 

c N 

s.t. — > ai > Vi and > c^y,; = 

i 

where Vî € [l,n], a» G R are the dual variables and if is the kernel matrix for filtered 
samples in the gaussian case. When a^ is the kernel bandwidth, K is defined by : 



Kij = fc(X v ,X,- .) = cxp I - " V 2(t2 J "" I (4) 

Note that for any FIR filter, the resulting K matrix is always positive definite if k(-, •) is 
défini te positive. Indeed, suppose k(-, •) a kernel from X 2 to R and a mapping from 
any X 1 to X, then fc'(-, •) = k((f>(-), </>(•)) is a positive definite kernel . Hère, our filter 
is a linear combination of M. d éléments, which is still in R d . 

Once the classifier is learned, the décision function for a new filtered signal Xte at 
sample i is : 

n 

g{i,Xte) =^2ajy j k(Xte i> .,X jt .) (5) 

3=1 

We show in the experiment section that this approach leads to improvement over the 
usual non-filtered approach. But the methods rely on the choice of a filter depending 
on prior information or user knowledge. And there is no évidence that the user-selected 
filter will be optimal in any sensé for a given classification task. 

3 Large Margin Filtering for non-linear problems (KF- 

SVM) 

We propose in this section to jointly learn the filtering matrix F and the classifier, 
this method will be named KF-S VM in the following. It leads to a filter maximizing the 
margin between the classes in the feature space. The problem we want to solve is : 

mm l\\g\\ 2 + -Y / H(y l ,X h .,g) + \n(F) (6) 

2—1 

with A a regularization parameter and Î7() a differentiable regularization function of F. 
We can recognize in the left part of Equation © a SVM problem for filtered samples 
X but with F as a variable. This objective function is non-convex. However, for a fixed 
F, the optimization problem wrt. g(-) is convex and boils down to a SVM problem. So 
we propose to solve Equation (O by a coordinate-wise approach : 

min J(F) = min J'(F) + Afi(-F) (7) 

F F 
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with : 

1 C n 

J'{F)= min || ff || 2 + -V#(y„X v ,. g ) (8) 

g 2 n '-^ 

i—l 

= max Jsvm(<x,F) (9) 

where Jsvm is defined in Equation (f3]l and <?(•) is defined in Equation (0. Due to the 
strong duality of the SVM problem, </'(•) can be expressed in his primai or dual form 
(see (0 and (191) . The objective fonction J defined in Equation (0 is non-convex. But 
according to lBonnans & Shapirol (119981) for a given F*, J'() is differentiable wrt. F. 
At the point F*, the gradient of J(-) can be computed. Finally we can solve the problem 
in Equation ((7j by doing a gradient descent on J(F) along F. 

Note that due to the non-convexity of the objective functions, problems (0 and (0 are 
not strictly équivalent. But its advantageous to solve (0 because it can be solved using 
SVM solvers and our method would benefit from any improvement in this domain. 



3.1 KF-SVM Solver and complexity 

For solving the optimization problem, we propose a conjugate gradient (CG) descent 
algorithm along F with a line search method for finding the optimal step. The method 
is detailed in Algorithm Q] where (3 is the CG update parameter and D F the descent 
direction for the ith itérat ion. For the expérimen tal results we used the f3 proposed by 
Fletcher and Reeves, see (IHager & Zhangi 120061) for more information. The itérations 



in the algorithm may be stopped by two stopping criteria : a threshold on the relative 
variation of J(F) or on the norm of the variation of F. 

Algorithm 1 KF-SVM solver 

Set F u>v — 1// for v = 1 • • • d and u = 1 • • • / 

Set i=0, Set D° F = 

repeat 

i=i+l 

G F <- gradient of J'(F) + XQ(F) wrt. F 

\\G i II 2 

fi *~ iiL-i|i, (Fletcher and Reeves) 

II<Jf II 

Dp^-Gp + pD'f 1 
(F 1 , a*) <— Line-Search along D F 
until Stopping criterion is reached 



Note that for each computation of J(F) in the line search, the optimal a* is found 
by solving an SVM. A similar approach , has been used to solve the Multiple-Kernel 



by solving an aVM. A similar approacn, nas been used to solve me Multiple-Kernel 
problem in (IRakotomamonjy et alX 120081) where the weights of the kernels are learned 



by gradient descent and the SVM is solved iteratively. 
At each itération of the algorithm the gradient of J'(F) + \Çl(F) has to be computed. 
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With a Gaussian kernel the gradient of </'(•) wrt. F is 



n,n 



VJ(F Uj „) = - — 2_^i x i+i-u,v - Xj + i- UtV )(X i>v - Xj^Kijyiyja^dj (10) 

where a* is the SVM solution for a fixed F. We can see that the complexity of this 
gradient is 0(n 2 .f 2 ) but in practice, SVM hâve a sparse support vector représentation. 
So in fact the gradient computation is (D(n 2 f 2 ) with n s the number of support vector 
selected. 

Due to the non-convexity of the objective function, it is difficult to provide an exact 
évaluation of the solution complexity. However, we know that the gradient computation 
is 0(n 2 .f 2 ) and that when J(F) is computed in the line search, a SVM of size n is 
solved and a 0(n.f.d) filtering is applied. Note that a warm-start trick is used when 
using iteratively the SVM solver in order to speed up the method. 



3.2 Filter regularization 

In this section we discuss the choice of the filter regularization term. This choice 
is important due to the complexity of the KF-SVM model. Indeed, learning the FIR 
filters adds parameters to the problem and regularization is essential in order to avoid 
over-fitting. 

The first regularization term that we consider and use in our KF-SVM framework is 
the Frobenius norm : 

f,d 

fi 2 (F)=£X* (11) 

u,v 

This regularization term is differentiable and the gradient is easy to compute. Mini- 
mizing this regularization term corresponds to minimizing the filter energy. In terms 
of classification, the filter matrix can be seen as a kernel parameter weighting delayed 
samples. For a given column, such a sequential weighting is related to a phase/delay 
and cut-off frequency of the filter. Moreover the Gaussian kernel defined in Equation|4] 
shows that the per column convolution can be seen as a scaling of the channels prior to 
kernel computation. The intuition of how this regularization term influences the filter 
learning is the following. Suppose we learn our décision function g(-) by minimizing 
only </'(.), the learned filter matrix will maximize the margin between classes. Adding 
the Frobenius regularizer will force non-discriminative filter coefficients to vanish thus 
yielding to reduced impact on the kernel of some delayed samples. 

Using this regularizer, ail filter coefficients are treated independently, and even if it 
tends to down-weight some non-relevant channels, filter coefficients are not sparse. If 
we want to perform a channel sélection while learning the filter F, we hâve to force 
some columns of F to be zéro. For that, we can use a l\ — £2 mixed-norm as a regula- 
rizer : 

^- 2 (^) = E(IX^ =X>(ii^ii 2 ) (12) 
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with h(x) = I? the square root function. Such a mixed-norm acts as a £2 norm on each 
single channel filter while the £\ norm on each channel filter energy will tend to vanish 
ail coefficients related to a channel. As this regularization term is not differentiable, 
the solver proposed in Algorithm Q] can not be used. We address the problem through 



a Majorization-Minimization algorithm (IHunter & La nge, 2004) that enables us to take 



advantage of the KF-SVM solver proposed above. The idea hère is to iteratively replace 
h(-) by a majorization and to minimize the resulting objective function. Since h(-) is 
concave in its positive orthant, we consider the following linear majorization of h(-) at 
a given point Xq > : 



1 1 _i 
Vx > 0, h(x) < Xq 4- -x 2 (x — Xq) 



2' 

The main advantage of a linear majorization is that we can re-use KF-SVM algorithm. 
Indeed, at itération k+1, for F^ the solution at itération k, applying this linear majori- 
zation of h(\\F. ]t) ||), around a \\F} V ' || yields to a Majorization-Minimization algorithm 
for sparse filter learning which consists in iteratively solving : 

min J'(F) + \n d (F) (13) 

F( k + r > 



I 



with n d (F) = ^dj J2 F lv and d - = 7^ 



F ( k )\ 



Od is a weighted Frobenius norm, this regularization term is differentiable and the KF- 
SVM solver can be used. We call this method Sparse KF-SVM (SKF-S VM) and we use 
hère similar stopping criteria as in Algorithm[T| 

3.3 Online and Viterbi decoding 

In this section, we discuss the decoding complexity of our method in two cases : when 
using only the sample classification score for décision and when using an offline Viterbi 
decoding of the complète séquence. 

First we discuss the online decoding complexity. The multi-class case is handled by 
One-Against-One strategy. So in order to décide the label of a given sample, the score 
for each class has to be computed with the décision function (0 that is 0(n s ) with n s 
the number of support vectors. Finally the decoding of a séquence of size n is 0{n s .en) 
with c the number of classes. 

The offline Viterbi decoding relies on the work of iGanapa thiraju et al.\ (120041) who 



proposed to transform the output of SVM classifiers into probabilities with a sigmoid 



function ( Lin et al.luOOTn . The estimated probability for class k is : 



P(y == k\x) = — (14) 

W ' ' l + cxp{A.g k (x)+B) 

where g k is the One-Against-All décision function for class k and x the observed 
sample. A and B coefficients are learned by maximizing the log-likelihood on a va- 
lidation set. The inter-class transition matrix M is estimated on the learning set. Finally 
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the Viterbi algorithm is used to obtain the maximum likelihood séquence. The com- 
plexity for a séquence of size n is then 0(n s .c.n) to obtain the pseudo-probabilities 
and 0(n.c 2 ) to décode the séquence. 

3.4 Related works 

To the best of ourknowledge, there has been few works dealing with the joint learning 
of a temp oral filter and a décisi on function. The first one addressing such a problem is 
our work dFlamarv et al.l 120101) that solves the problem for linear décision functions. 



Hère, we hâve extended this approach to the non-linear case and we hâve also inves- 
tigated the utility of différent regularizers on the filter coefficients. Notably, we hâve 

introduced regularizers that help in performing channe l sélection. 

Works on Common Sparse Spatio-Spectral Patterns Dornhege et al.\ (120061) are pro- 



bably those that are the most similar to ours. Indeed, they want to learn a linear combi- 
nation of channels and samples that optimize a separability criterion. But the criterion 
optimized by the two algorithms are différent : CSSSP aims at maximizing the variance 
of the samples for the positive class while minimizing the variance for the négative 
class, whereas KF-SVM aims at maximizing the margin between classes in the feature 
space. Furthermore, CSSSP is a feature extraction algorithm that is independent to the 
used classifier whereas in our case, we learn a filter that is tailored to the (non-linear) 
classification algorithm criterion. Furthermore, the filter used in KF-SVM is not res- 
tricted to signal time samples but can also be applied to complex sequential features 
extracted from the signal (e.g PSD). An application of this latter statement is provided 
in the expérimental section. 

KF-SVM can also be seen as a kernel learning method. Indeed the filter coefficients 
can be interpreted as kernel parameters despite the fact that samples ar e non-iid. Lear- 



ning such a ke rnel parameter s is now a common approach introduced by lChapelle et al. 



(120021) . While IChapelle et q/.l minimize a bound on generalization error by gradient des 



cent, in our case we simply minimize the SVM objective function and the influence on 
the parameters differ. More precisely, if we focus on the colums of F we see that the 
coefficients of thèse columns act as a scaling of the channels. F or a filter of size 1, 



our a pproach would correspond to adaptive scaling as proposed by Grandvalet & Canu 



(2003). In their work, they jointly learn the classifier and the Gaussian kernel parame- 
ter CTfe with a sparsity constraint on the dimensions of <Jk leading to automated feature 
sélection. KF-SVM can thus be seen as a generalization of their approach which takes 
into account samples sequentiality. 



4 Numerical experiments 
4.1 Toy Example 

In this section we présent the toy example used for numerical experiments. Then we 
discuss the performances and the parameter sensitivity of our method. 

We use a toy example that consists of a 2D non-linear problem which can be seen 
on Figure [T] Each class contains 2 modes, (—1,-1) and (1,1) for class 1 and (—1,1) 
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Toy signal labels along tii 
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200 400 600 800 1000 



Figure 1 - Toy example for a n — 0.5, lag = 0. The plots on the left show the évolu- 
tion of both channels and labels along time ; the right plot shows the non-linearproblem 
by projecting the samples on the channels. 





(a) Without filtering (err=0.404) 



F'ïLvatc histogram for fil 



Bivariate histogram for fillered c 




(b) With KF-SVM filtering (err=0.044) 



Figure 2 - Bivariate histograms for a non filtered (<j„ = 1, lag — 5) and KF-SVM 
filtered signal (left for class 1 and right for class 2) 



and (1,-1) for class 2, and their value is corrupted by a Gaussian noise of déviation 
a n . Moreover, the length of the régions with constant label follows a uniform distribu- 
tion between [30, 40] samples. A time-lag drawn from a uniform distribution between 
[—lag, lag] is applied to the channels leading to mislabeled samples in the learning and 
test set. 

We illustrate the behavior of the large margin filtering on a simple example (a n = 
1, lag = 5). The bivariate histogram of the projection of the samples on the channels 
can be seen on Figure [2] We can see on Figure |2(a)| that due to the noise and time-lag 
there is an important overlap between the bivariate histograms of both classes, but when 
the large margin nlter is applied, the classes are better separated (Figure |2(b)| i and the 
overlap is reduced leading to better classification rate (4% error vs 40%). 

S VM, Avg-SVM (signal filtered by average filter), KF-SVM and SKF-SVM are com- 
pared with and without Viterbi decoding. In order to test high dimensional problems, 
some channels containing only gaussian noise are added to the 2 discriminative ones 
leading to a toy signal of nbtot channels. The size of the signal is of 1000 samples for 
the learning and the validation sets and of 10000 samples for the test set. In order to 
compare fairly with Avg-SVM, we selected / = 11 and uq — 6 corresponding to a 
good average filtering centered on the current sample. The regularization parameters 
are selected by a validation method. Ail the processes are run ten times, the test error is 
then the average over the runs. 

We can see in Figure [3] the test error for différent noise value a n and problem size 
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(a) varying noise value o^ 



(b) varying size nbtot of the problem 



FIGURE 3 - Test error for différent problem size and noise for the toy example (plain 
Unes : sample classification , dashed Unes : Viterbi decoding) 



nbtot. Both proposed methods outperform SVM and Avg-S VM with a Wilcoxon signed- 
rank test p-value< 0.01. Note that results obtained with KF-SVM without Viterbi deco- 
ding are even better than those observed with SVM and Viterbi decoding. This is proba- 
bly because as we said previously, HMM can not adapt to time-lags because the learned 
density estimation are biased. Surprisingly, the use of the sparse regularization does not 
statistically improve the results despite the intrinsic sparsity of the problem. This cornes 
from the fact that the learned filters of both methods are sparse due to a numerical pré- 
cision thresholding for KF-SVM with Frobenius regularizer. Indeed the A coefficient 
selected by the validation is large, leading to a shrinkage of the non-discriminative 
channels. 

We discuss the importance of the choice of our model parameters. In fact KF-SVM 
has 4 important parameters that hâve to be tuned : Uk, C, A and /. Those parameters 
hâve to be tuned in order to fit the problem at hand. Note that a^ and C are parame- 
ters linked to the SVM approach and that the remaining ones are due to the filtering 
approach. In the results presented below, a validation has been done to sélect A and C. 




Test error with différent kernel parametera 
(nbtot=10,c =1) 




(a) varying f 



(b) varying sigma 



FIGURE 4 - Test error for différent parameters on the toy example (plain lines : sample 
classification , dashed lines : Viterbi decoding) 



We can see on the left of Figure [4] the performances of the différent models for a 
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Method 


SUB 1 


Sues 2 


Sub3 


AVG 


BCI COMP. 


0.2040 


0.2969 


0.4398 


0.3135 


SVM 


0.2368 


0.4207 


0.5265 


0.3947 


KF-SVM 

/ = 2 
/ = 5 
/ = 10 


0.2140 
0.1840 
0.1598 


0.3732 
0.3444 
0.2450 


0.4978 
0.4677 
0.4562 


0.3617 
0.3320 
0.2870 



Table 1 - Test Error for BCI Dataset 

varying /. Note that / has a big impact on the performances when using Avg-S VM. On 
the contrary, KF-SVM shows good performances for a sufficiently long filter, due to the 
learning of the filtering. Our approach is then far less sensitive to the size of the filter 
than Avg-S VM. Finally we discuss the sensitivity to the kernel parameter <ik ■ Test errors 
for différent values of this parameters are shown on Figure |4](right). It is interesting to 
note that KF-SVM is far less sensitive to this parameter than the other methods. Simply 
because the learning of the filtering corresponds to an automated scaling of the channels 
which means that if the er^ is small enough the scaling of the channels will be done 
automatically. In conclusion to thèse results, we can say that despite the fact that our 
method has more parameters to tune than a simple SVM approach, it is far less sensitive 
to two of thèse parameters than SVM. 



4.2 BCI Dataset 



We test our method on the BCI Dataset from BCI Compétition III (IBlankertz et al. 
20041) . The problem is to obtain a séquence of labels out of brain activity signais for 3 
human subjects. The data consists in 96 channels containing PSD features (3 training 
sessions, 1 test session, n « 3000 per session) and the problem has 3 labels (left arm, 
right arm or a word). 

For computational reasons, we decided to decimate the signal by 5, doing an avera- 
ging on the samples. We focus on online sample l abeling (no = 0) and we test KF-SVM 
for filter length / corresponding to those used in dFlamarv et al\,\2Q\(h . The regulariza- 
tion parameters are tuned using a grid search validation method on the third training set. 
Our method is compared to the best BCI compétition results and to the SVM without 
filtering. Test error for différent filter size / can be seen on Table Q] We can see that 
we improve the BCI Compétitio n results by using longer filtering. We obtain similar 
results than those reported in iFlamarv et al.\ (|2010) but slightly worst. This probably 
cornes from the fact that the features used in this Dataset are PSD and are known to 
work well in the linear case. But we still obtain compétitive results which is promising 
in the case of non-linear features. 



5 Conclusions 



We hâve proposed a framework for learning large-margin filtering for non-linear 
multi-channel sample labeling. Depending on the regularization term used, we can do 
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either an adaptive scaling of the channels or a channels sélection. We proposed a conju- 
gate gradient algorithm to solve the minimization problem and empirical results showed 
that despite the non-convexity of the objective function our approach performs better 
than classical SVM methods. We tested our approach on a non-linear toy example and 
on a real life BCI dataset and we showed that sample classification rate and précision af- 
ter Viterbi decoding can be drastically improved. Furthermore we studied the sensitivity 
of our method to the regularization parameters. 

In future work, we will study the use of prior information on the classification task. 
For instance when we know that the noise is in high frequencies then we could force 
the filtering to be a low-pass filter. In addition, we will address the problem of compu- 
tational learning complexity as our approach is not suitable to large-scale problems. 
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